Exploratory Data Analysis for Kc_House_Data

This is a tutorial notebook on how to perform EDA for a dataset.

The chosen data for this tutorial is House Sales in King County, USA, available on Kaggle.

Check the blog post How to perform EDA for machine learning? for more informations about the used EDA method in this notebook.

Overall View

Feature Naure:

From the type of the features and their values count, we can determine the nature of each feature:

Univariate Analysis

In this step of the EDA, each variable is examined and assessed by itself. Usually this step dosen't provide valuable insight, however it helps understanding each feature better by visualizing its distribution and examining it's statistics.

Usually the 'id' variable is ignored because it has no meaning and it is only used to index each row with a unique identifier.

Date

Remark: Notice is that the date type is 'str', so we need to convert it to a timestamp variable, which is achieved using the pandas method .to_date_time()

Price

Remark: Notice that the distribution of prices is extremeply right skewed, and that we have 1146 outlires out of 21613 entries. Almost 94.7% of the house prices are below 1127500.

Bedrooms & Bathrooms

As shown in the image above, there are homes with three-quarters and half of a bathroom, and this means: A 1.5 bath would mean one full bathroom, and one half bathroom. A 0.5 bathroom is called a half bath. It doesn't mean half bath in terms of its size in square feet. A half bath offers a sink and a toilet but no shower or bathtub. This type of math notations for bathrooms are commonly used in USA and that's why it appears in this dataset.

Sqft_living, sqft_lot, sqft_living15, sqft_lot15, sqft_above, and sqft_basement.

sqft_living15 & sqft_lot15: Living room area and lot area in 2015, implying that there was some renovations.

floors, waterfront, view, condition, and grade

yr_built and yr_renovated

Lon, and Lat

The best practice in dealing with longitude & Latitude variables is to plot them on a map to visualize the distribution (scatter) of positions on a real scale. And this is valid for both univariate and multivariate analysis.

Conclusion of univariate analysis

Many of the categorical features on the dataset are heavely unbalanced like 'condition', 'view', 'waterfront', and 'floors', which may be the cause of the extreme skeweness of the distribution of hous prices and areas. These speculations can be further inspected by carrying out a multivariate analysis, which is the object of the following sections.

Multivariate Analysis

Starting with bivariate analysis, and since we have a target variable which is the house prices, then we can limit the bivariate analysis to the 'price' vs All the other significant features. But first, let's take a quick look on the pair scatter plot of the numer

Conclusion

In this notebook, I presented a simple and methodical way of performing an EDA for structured and clean data. In practice, data are collected in raw state and needs more cleaning work. The presented EDA was not aiming for a specific task even though we have concidered the price feature as a target variable for a classification task, but in case we're going to build a model there are more analysis to be made. For instance, we can further inspect the drop in price for the houses that have 6.5-7.5 bathrooms, and we can also think about binning some features and rechck whether or not a pattern has emmerged.